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Abstract 



,—1 We present MuMax, a general-purpose micromagnetic simulation tool running on Graphical Processing Units (GPUs). MuMax 
I is designed for high performance computations and specifically targets large simulations. In that case speedups of over a factor 
lOOx can easily be obtained compared to the CPU-based OOMMF program developed at NIST. MuMax aims to be general and 
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of running extensive simulations that would be nearly inaccessible with typical CPU-based simulators. 
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broadly applicable. It solves the classical Landau-Lifshitz equation taking into account the magnetostatic, exchange and anisotropy 
interactions, thermal effects and spin-transfer torque. Periodic boundary conditions can optionally be imposed. A spatial discretiza- 
tion using finite differences in 2 or 3 dimensions can be employed. MuMax is publicly available as open source software. It can 
thus be freely used and extended by community. Due to its high computational performance, MuMax should open up the possibility 
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Micromagnetic simulations are indispensable tools in the 
field of magnetism research. Hence, micromagnetic simulators 
like, e.g., OOMMF magpar |2| and Nmag [3| are widely 
used. These tools solve the Landau-Lifshitz equation on reg- 
ular CPU hardware. Due to the required fine spatial and tem- 
poral discretizations, such simulations can be very time con- 
suming. Limited computational resources therefore often limit 
the full capabilities of the otherwise successful micromagnetic 
approach. 

There is currently a growing interest in running numerical 
calculations on graphical processing units (GPUs) instead of 
CPUs. Although originally intended for purely graphical pur- 
poses, GPUs turn out to be well-suited for high-performance, 
general-purpose calculations. Even relatively cheap GPUs can 
perform an enormous amount of calculations in parallel. E.g., 
the nVIDIA GTX580 GPU used for this work costs less than 
$ 500 and delivers 1 .5 trillion floating-point operations (Flops) 
per second, about 2 orders of magnitude more than a typical 
CPU. 

However, in order to employ this huge numerical power 
programs need to be written specifically for GPU hardware, 
using the programming languages and tools provided by the 
GPU manufacturer, and the code also needs to handle many 
hardware-specific technicalities. Additionally, the used algo- 
rithms need to be expressed in a highly parallel manner, which 
is not always easily possible. 

Other groups have already implemented micromagnetic sim- 
ulations on GPU hardware and report considerable speedups 



compared to a CPU-only implementation f?! '5^, '61 . At the time 
of writing, however, none of these implementations is freely 
available. MuMax, on the other hand, is available as open 
source software and can be readily used by anyone. Its perfor- 
mance also compares favorably to these other implementations. 



2. Methods 

Since the micromagnetic theory describes the magnetization 
as a continuum field M(r, f), the considered magnetic sample is 
discretized in cuboidal finite difference (FD) cells with a uni- 
form magnetization. The time evolution of the magnetization 
in each cell is given by the Landau-Lifshitz equation 
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Here, Ms is the saturation magnetization, y the gyromagnetic 
ratio and a the damping parameter. The continuum effective 
field Hpyy has several contributions that depend on the magne- 
tization, the externally applied field and the material parameters 
of the considered sample. When timestepping equation ([TJ the 
effective field is evaluated several times per time step. Hence, 
the efficiency of micromagnetic software depends on the effi- 
cient evaluation of the different effective field terms at the one 
hand and the application of efficient time stepping schemes on 
the other hand. MuMax combines both with the huge compu- 
tational power of GPU hardware. 
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2.1. Effective field terms 

In the present version of MuMax, the effective field can have 
5 different contributions: the magnetostatic field, the exchange 
field, the applied field, the anisotropy field and the thermal field. 
In what follows we present these terms and comment on their 
optimized implementation. 

2.1.1. Magnetostatic field 

The magnetostatic field H,„s accounts for the long-range in- 
teraction throughout the complete sample 
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Since the magnetostatic field in one FD cell depends on the 
magnetization in all other FD cells, the calculation of H„„ is 
the most time-consuming part of a micromagnetic simulation. 
The chosen method for this calculation is thus decisive for the 
performance of the simulator Therefore, we opted for a fast 
Fourier transform (FFT) based method. In this case, the convo- 
lution structure of (j2]i is exploited. By applying the convolution 
theorem, the convolution is accelerated by first Fourier trans- 
forming the magnetization, then multiplying this result with the 
Fourier-transform of the convolution kernel and finally inverse 
transforming this product to obtain the magnetostatic field. The 
overall complexity of this method is 0(N log A^), as it is domi- 
nated by the FFTs. 

Methods with even lower complexity exist as well. The fast 
multipole method, e.g., only has complexity 0(N), but with 
such a large pre-factor that in most cases the FFT method re- 
mains much faster |7 |. 

A consequence of the FFT method is that the magnetic mo- 
ments must lie on a regular grid. This means that a finite differ- 
ence (FD) spatial discretization has to be used: space is divided 
into equal cuboid cells. This method is thus most suited for 
rectangular geometries. Other shapes have to be approximated 
in a staircase-like fashion. However, thanks to the speedup of- 
fered by MuMax's, smaller cells may be used to improve this 
without excessive performance penalties. 

The possibility of adding periodic boundary conditions in 
one or more directions is also included in the software. This is 
done by adding a sufficiently large number of periodic images 
to the convolution kernel. The application of periodic boundary 
conditions has a positive influence on the computational time 
since the magnetization data does not need to be zero padded in 
the periodic directions, which roughly halves the time spend on 
FFTs for every periodic direction. 

2.7.2. Exchange field 

The exchange interaction contributes to the effective field in 
the form of a laplacian of the magnetization 
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with A the exchange stiffness. In discretized form, this can 
be expressed as a linear combination of the magnetization of 
a cell and a number of its neighbors. MuMax uses a 6-neighbor 



scheme, similar to |,8J. In the case of 2D simulations (only one 
FD cell in the z-direction), this method automatically reduces 
to a 4-neighbor scheme. 

The exchange field calculation is included in the magneto- 
static field routines by simply adding the kernel describing the 
exchange interaction to the magnetostatic kernel. In this way, 
the exchange calculation is essentially free, as only one joint 
convolution product is needed to simultaneously evaluate both 
the magnetostatic and exchange fields. Moreover, by introduc- 
ing the exchange contribution in the magnetostatic field kernel 
periodic boundary conditions are directly accounted for if ap- 
pUcable. 



2.1.3. Other effective field terms 

Next to the above mentioned interaction terms and the ap- 
plied field contribution, MuMax provides the ability to include 
magnetocrystalline anisotropy. Currently, uniaxial and cubic 
anisotropy are available. The considered anisotropy energies 
are 
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and 



<Pani(r) = Ki [a](r)al(r) + al(r)al(r) + a](r)al(r)\ 
+ K2 [al(r)al(r)al(r)] 



(5) 



for uniaxial and cubical anisotropy respectively. Here, Ku 
and (Ki , K2) are the uniaxial and cubical anisotropy constants, 
is the angle between the local magnetization and uniaxial 
anisotropy axis and a,- (/ = 1, 2, 3) are the direction cosines be- 
tween the local magnetization and the cubic easy magnetization 
axes. 

Furthermore, thermal effects are included by means of a fluc- 
tuating thermal field 
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which is added to the effective field ^eff according to Q. In 
(|6]l, ks is the Boltzmann constant, V is the volume of a FD 
cell, 6t is the used time step and j/(r, t) is a stochastic vector 
whose components are Gaussian random numbers, uncorrelated 
in space and time with zero mean value and dispersion 1 . 



2.1.4. Spin-transfer torque 

The spin-transfer torque interaction describes the influence 
of electrical currents on the local magnetization. Possible ap- 
plications are spin-transfer torque random access memory [IQl 
and racetrack memory |ir|. MuMax incorporates the spin- 
transfer torque description developed by Berger lfT2]| . refined 
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by Zhang and Li |fT3l 
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Here, ^ is the degree of non-adiabicity and bj is the coupling 
constant between the current density j and the magnetization 
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with P the polarization of the current density, hb the Bohr mag- 
neton and e the electron charge. 

2.2. Time integration schemes 

MuMax provides a range of Runge-Kutta (RK) methods to 
integrate the Landau-Lifshitz equation. Currently the user can 
select between the following options: 

• RKl: Euler's method 

• RK2: Heun's method 

• RKl 2: Heun-Euler (adaptive step) 

• RK3: Kutta's method 

• RK23: Bogacki-Shampine (adaptive step) 

• RK4: Classical Runge-Kutta method 

• RKCK: Cash-Karp (adaptive step) 

• RKDP: Dormand-Prince (adaptive step) 

The adaptive step methods adjust the time step based on a 
maximum tolerable error per integration step that can be set by 
the user. The other methods can use either a fixed time step 
or a fixed maximum variation of m per step. Depending on 
the needs of the simulation, a very accurate but relatively slow 
high-order solver (e.g. RKDP) or a less accurate but fast solver 
(e.g. RK12) can be chosen. Additionally, MuMax incorporates 
the semi-analytical methods described in lfT4l . These methods 
are specifically tailored to the Landau-Lifshitz equation. 

3. GPU-optimized implementation 

Since various CPU based micromagnetic tools — well suited 
for relatively small micromagnetic problems — are already 
available, we mainly concentrated on optimizing MuMax for 
running very large simulations on GPUs. Nevertheless, the 
code can also run in CPU-mode, with multi-threading modali- 
ties enabled. In this way one can get familiar with the capabili- 
ties of MuMax before a high-end GPU has to be purchased. 



The GPU-specific parts of MuMax have been developed us- 
ing nVIDIA's CUDA platform. The low-level, performance- 
critical functions that have to interact directly with the GPU are 
written in C/C+-H. Counterparts of these functions for the CPU 
are implemented as well and use FFTW ifTSl and multi thread- 
ing. The high-level parts of MuMax are implemented in "safe" 
languages including Java, Go and Python. This part is indepen- 
dent of the underlying GPU/CPU hardware. In what follows 
we will only elaborate on the GPU-optimized implementation 
of the low-level functions. 

3.1. General precautions 

A high-end GPU has its own dedicated memory with a high 
bandwidth (typically a few hundred GB/s) which enables fast 
reads and writes on the GPU itself. Communication with the 
CPU on the other hand is much slower since this takes place 
over a PCI express bus with a much lower bandwidth (typically 
a few GB/s). Therefore, our implementation keeps as much data 
as possible in the dedicated GPU memory, avoiding CPU-GPU 
communication. The only large data transfers occure at initial- 
ization time and when output is saved to disk. The CPU thus 
only instructs the GPU which subroutines to launch. Hence, the 
GPU handles all the major computational jobs. 

On the GPU, an enormous number of threads can run in par- 
allel, each performing a small part of the computations. E.g., 
the GTX580 GPU used for this work has 512 computational 
cores grouped in 16 multiprocessors, resulting in total number 
of 16384 available parallel threads. However, this huge parallel 
power is only optimally exploited when the code is adapted to 
the specific GPU architecture. E.g.: threads on the same multi- 
processor ("thread warps") should only access the GPU mem- 
ory in a coalesced way and should ideally perform the same in- 
structions. When coalesced memory access in not possible, the 
so-called shared memory should be used instead of the global 
GPU memory. This memory is faster and has better random- 
access properties but is very scarce. Our implementation takes 
into account all these technicalities, resulting in a very high per- 
formance. 

3.2. GPU -optimization of the convolution product 

Generally, most computational time goes to the evaluation 
of the convolution product defined by the magnetostatic field. 
When using fast Fourier transforms (FFTs), the computations 
enhance three different stages: (i) forward Fourier transform- 
ing the magnetization data that is zero padded in the non- 
periodic directions, (ii) point-by-point multiplying the obtained 
data with the Fourier transformed magnetostatic field kernel, 
(iii) inverse Fourier transforming the resulting magnetostatic 
field data. The carefull implementation of these three stages 
determines the efficiency of the convolution product and, more 
general, of the micromagnetic code. 

In the first place, the efficiency of this convolution process 
is safeguarded by ensuring that the matrices defined by the 
magnetostatic field kernel are completely symmetrical. Conse- 
quently, the Fourier transformed kernel data is purely real. The 
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absence of the imaginary part leads to smaller memory require- 
ments as well as a much faster evaluation of the point-by-point 
multiplications - step (ii). 

Furthermore, our GPU implementation of the fast Fourier 
transforms, which internally uses the CUDA "CUFFT" library, 
is specifically optimized for micromagnetic applications. The 
general 3D real-to-complex Fourier transform (and its inverse) 
available in the CUFFT library is replaced by a more eflicient 
implementation in which the set of ID transforms in the dif- 
ferent directions are performed separately. This way, Fourier 
transforms on arrays containing only zeros resulting from the 
zero padding are avoided. In each dimension, the set of ID 
Fourier transforms are performed on contiguous data points re- 
sulting in the coalesced reading and writing of the data. As a 
drawback, the transposition of the data between a set of Fourier 
transforms in one and another dimension is needed. 

In a straight forward implementation of the required matrix 
transposes, the read and write instructions can not be both per- 
formed in a coalesced way since either the input data or the 
transposed data is not contiguous in global memory. Therefore, 
the input data is divided in blocks and copied to shared mem- 
ory assigned to a predefined number of GPU threads. There, the 
data block is transposed and copied in a coalesced way back to 
the global GPU memory space. By inventively using the large 
number of zero arrays -in the non-periodic case- this trans- 
pose process can be done without (for 2D) or with only limited 
(3D) extra memory requirements. The different sets of Fourier 
transforms in this approach are performed using the ID FFT 
routines available in the CUFFT library. This implementation 
outperforms the general 3D real-to-complex Fourier transform 
available in the CUFFT library while the built-in 2D real-to- 
complex is only faster for small dimensions (for square geome- 
tries: smaller than 512x512 FD cells). This approach ensures 
the efficient evaluation of steps (i) and (iii) of the convolution. 

3.3. Floating point precision 

GPUs are in general better suited for single-precision than 
double-precision arithmetic. Double-precision performance is 
not only much slower due to the smaller number of arithmetic 
units, but also requires twice the amount of memory. Since 
GPU's typically have Umited memory and FFT methods are 
relatively memory-intensive, we opted to uses single-precision 
exclusively. 

While, e.g., the finite element method used by Kakay et al. 
relies heavily on double precision to obtain an accurate solution 
1^1, our implementation is designed to remain accurate even at 
single precission. First, all quantities are internally stored in 
units that are well adapted to the problem. More specifically, 
we choose units so that f^o - 7o - - A = I. This avoids 
that any other quantity in the simulation becomes exceptionally 
large or small — which could cause a loss of precision due to 
saturation errors. The conversion to and from internal units is 
performed transparently to the user. Secondly, we avoid nu- 
merically instable operations like, e.g., subtracting nearly equal 
numbers. This avoids that small rounding of errors get ampli- 
fied. Finally, and most importantly, we restrict the size of the 
FFTs to numbers where the CUFFT implementation is most 



accurate: 2" x {1, 3, 5 or 7). Hence sometimes a slightly larger 
number of FD cells than strictly necessary is used to meet this 
requirement. Fortunately this has no adverse effect on the per- 
formance since CUFFT FFTs with these sizes also happen to be 
exceptionally fast (see below). In this way, the combined error 
introduced by the forward-i-inverse FFT was found to be only of 
the order of (9(10^^), as opposed to a typical error of (9(10 "^) 
for other FFT sizes (estimated from the error on transforming 
random data back and forth). Thanks to these precautions we 
believe that our implementation should be sufficiently accurate 
for most practical applications. Indeed, the uncertainty on ma- 
terial parameters alone is usually much larger than the FFT er- 
ror of 10"^ 

4. Validation 

In order to validate our software, we tested the reliability 
of the code by simulating several standard problems. These 
standard problems are constructed such that all different contri- 
butions in the considered test case influence the magnetization 
processes significantly. A correct simulation of standard prob- 
lems can be considered as the best possible indication of the 
validity of the developed software. In what follows, we con- 
sider standard problems constructed for testing static simula- 
tions, dynamic simulations and dynamic simulations incorpo- 
rating spin-transfer torque. 

4.1. static standard problem 

The //MAG standard problem #2 fT6\ aims at testing quasi 
static simulations. A cuboid with dimensions 5d x d x O.ld 
is considered. Since only magnetostatic and exchange interac- 
tions are included, the resulting static properties only depend 
on the scaled parameter d/l^.^, with the exchange length. 
The starting configuration is saturation along the [1,1,1] axis, 
which is relaxed to the remanent state. This was done by solv- 
ing the Landau-Lifshitz equation with a high damping parame- 
ter a = 1 . 

The number of FD cells was chosen depending on the size 
of nanostructure, making sure the cell size remained below the 
exchange length. For c//Z„ < 10, single-domain states with 
nearly full saturation along the long axis were found, while for 
large geometries an S-state occured. The MuMax simulations 
considered 200 values for d/l^^-. On the GPU a total simulation 
time of 3'21" was needed to complete the 200 individual simu- 
lations, compared to 34' 30" on the CPU. The GPU speedup is 
here Umited by the relatively smaU simulation sizes (cfr. Fig. 

Figure [T] shows the remanent magnetization in function of 
the ratio d/lgx- The values obtained with MuMax coincide 
well with those of other authors 1.17. J8J . validating MuMax 
for static micromagnetic problems. 

4.2. dynamic standard problem 

The pMAG standard problem #4 lfT6l aims at testing the de- 
scription of the dynamic magnetization processes by consider- 
ing the magnetization reversal in a thin film with dimensions 
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Figure 1: Standard problem #2. Remanent magnetization along the A:-axis (left 
axis) and along the }'-axis (right axis) as a function of the scaling parameter 
d. The full line represents the simulation results from MuMax, while the cir- 
cles represent simulation points obtained from McMichael et al. 1 17 ,1 and from 
Donahue et al. 1181 
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500 nm X 125 nm x 3 nm. Starting from an initial equilibrium 
S-state, two different fields are applied. In this problem only 
the exchange and magnetostatic interactions are considered (ex- 
change stiffness A = 1.3 x 10"'^ J/m, saturation magnetization 
Ms - 8.0 X 10^ Am"'). When relaxing to the initial equilibrium 
S-state, a damping constant equal to 1 is used while during the 
reversal itself a damping constant of 0.02 is applied, according 
to the problem definition. As proposed in the standard problem, 
we show in Figs. [2] and [3] the evolution of the average mag- 
netization components together with the reference magnetiza- 
tion configuration at the time point when < > crosses zero 
for the first time, for field 1 (jUoi/i=-24.6mT, fioHy= 4.3 mT, 
^ioHy- 0.0 mT) and field 2 (jt/()i/j^=-35.5 mT, jioHy- -6.3 mT, 
^qH,- 0.0 mT). A discretization using 128x32x 1 FD cells 
and the RK23 time stepping scheme with a time step around 
600 fs (dynamically adapted during the simulation) was used. 
The relatively large time steps used to solve this standard prob- 
lem demonstrate that MuMax incorporates robust time stepping 
schemes with accurate adaptive step mechanisms. The adap- 
tive step algorithms ensure optimal time step lengths and thus 
reduce the number of field evaluations, speeding up the simu- 
lation. Here, a total time of only 2.5 seconds was needed to 
finish this simulation on the GPU compared to 16 seconds on 
the CPU. In this case the speedup on GPU is limited due to the 
small number of FD cells. When the simulation is repeated with 
a finer discretization of 256 x64 x 2 cells, on the other hand, the 
GPU speedup already becomes more pronounced: the simula- 
tion finishes in 46 seconds on the GPU but takes 20' 32" on the 
CPU. 

From Figs. [2] and |3] it is clear that the results obtained with 
MuMax are well within the spread of the curves obtained by 
other authors. Also the magnetization plots are in close agree- 
ment with those available at the //Mag website lfT6l . 



Figure 2: (top) Time evolution of the average magnetization during the reversal 
considered in /jMag standard problem #4, fieldl. The results obtained with 
MuMax (black) lie well within the spread of the reference solutions (grey), 
taken from 116]. (bottom) Magnetization configuration when < M, > crosses 
the zero magnetization for the first time. 
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Figure 3: (top) Time evolution of the average magnetization during the reversal 
considered in /jMag standard problem #4, field2. This field was chosen to cause 
a bifurcation point to make the different solutions diverge, (bottom) Magneti- 
zation configuration when < > crosses the zero magnetization point for the 
first time. 
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Figure 4: Time evolution of tlie average in plane magnetization during tlie first 8 
ns of tlie spin-transfer torque standard problem. To facilitate the visual compar- 
ison of our results with Iil9il , the average magnetization is expressed in [A/m] 
and the same axes ratios are chosen. 



4.3. Spin-transfer torque standard problem [19^ 

juMag does not propose any standard problems that include 
spin-transfer torque. Therefore we rely on a standard prob- 
lem proposed by M. Najafi et al. |19| to check the valid- 
ity of the spin-transfer torque description implemented in Mu- 
Max. The standard problem considers a permalloy sample 
(A = 1.3 X 10"" J/m, M, = 8.0 x 10^ Am"') with dimensions 
100 nmx 100 nmx 10 nm. The initial equilibrium magnetiza- 
tion state is a predefined vortex, positioned in the center of the 
sample and relaxed without any spin-transfer torque interaction 
(a = 1.0). Once relaxed, an homogeneous spin-polarized dc 
current j = 10'^ Am"^ along the jc-axis is applied on the sam- 
ple. Now, a is 0.1 and the degree of non-adiabicity ^ is 0.05, 
see expression (|7]). Under these circumstances, the vortex cen- 
ter moves towards a new equilibrium position. The time evolu- 
tion of the average in plane magnetization and the magnetiza- 
tion configuration at t-Q.73 ns are shown respectively in Fig. |4] 
and Fig. |5] The results are in good agreement with those pre- 
sented in reference |19|. With a discretization of 128 x 128 FD 
cells, 10 minutes of simulation time were needed to obtain the 
presented data. 

5. Performance 

The performance of MuMax on the CPU is roughly compa- 
rable to OOMMF. The CPU performance is thus good, but our 
main focus is optimizing the GPU code. Special attention went 
to fully exploiting the numerical power of the GPU while fo- 
cussing on the time- and memory-efficient simulation of large 
micromagnetic problems. Figure |6] shows the time required to 
take one time step with the Euler method (i.e. effective field 
evaluation, evaluation of the LL-equation and magnetization 
update) on CPU (1 core) and on GPU for 2D and 3D simu- 
lations. 

In both the 2D and 3D case, speedups of up two orders of 
magnitude are obtained for large dimensions. For smaller ge- 
ometries, the speedups decrease but remain significant. This 




Figure 5: Magnetization configuration at t=0.73 ns as found during the simula- 
tion of a standard problem incorporating spin-transfer torque 1 19 1. The vortex 
core evolves towards a new equilibrium state under influence of a spin-polarized 
dc current allong the horizontal direction. This figure is rendered with the built- 
in graphics features present in MuMax. 



can be understood by the fact that in these simulations not 
enough FD cells are considered to have all (9(10'*) available 
threads at work at the same time. Hence, the computational 
power is not fully exploited. Furthermore, Fig. [6] shows that 
the CPU performance as well as the GPU performance does 
not follow a smooth curve. This is a consequence of the FFTs 
which are most efficient for powers of two, possibly multiplied 
with one small prime (in the benchmarks shown in Fig. |6] the 
default rounding to these optimal sizes is not performed). In 
the GPU implementation this is even more the case than in the 
CPU implementation. E.g., the 2D simulation with dimensions 
992 X 992 is five times slower than the 2D simulation with di- 
mensions 1024x1024. This shows that not only for accuracy 
reasons, but also for time efficiency reasons, it is most advan- 
tageous to restrict the simulations domain to the optimal di- 
mensions defined by 2" x { 1, 3, 5 or 7}. Because of the extreme 
impact on the performance of MuMax, we opted to standardly 
rescale the size of the FD cells such that the dimensions are 
rounded of to one of these optimal sizes. 

Due to the typical architecture of GPUs and the nature of the 
FFT algorithm, simulations of geometries with power of two 
sizes run extremely fast on GPU. Table [T] gives an overview of 
the speedups for these sizes for the 2D and 3D case and Fig. 
|7] shows the speedup obtained for these power of two sizes by 
MuMax compared to the OOMMF code. Both comparisons re- 
sult in speedups larger than a factor 100. This means that sim- 
ulations that used to take several hours can now be performed 
in minutes. The comparison between the speedups shown in 
Table [T] and Fig. |7]further show that our CPU implementation 
has indeed a comparable efficiency regarding to OOMMF. The 
immense speedups evidence the fact that MuMax can indeed 
open completely new research opportunities in micromagnetic 
modelling. 
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Figure 6: Time required to perform one time step using the Euler metliod for 
(top) 2D geometries with varying dimensions NxN and (bottom) 3D geome- 
tries with varying dimensions NxNxN. The CPU computations are performed 
on a 2.8 GHz intel core i7-930 processor, while the GPU computations are per- 
formed on nVIDIA GTX580 GPU hardware. 



Table 1 : Time needed to take one time step with the Euler method on CPU and 
GPU for 2D geometries (top) and 3D geometries (bottom). 
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Figure 7: Speedup obtained with MuMax running on a GTX580 GPU com- 
pared to OOMMF on a 2.8GHz core i7-930 CPU. The 2D and 3D geometiies 
have sizes NxN and NxNxN respectively. The lowest speedup for the 16x16 
X 16 case - an unusually small simulation- is still a factor 4. 



6. How to use MuMax 



MuMax is released as open source software under the GNU 
General Public License (GPL) v. 3 and can thus be freely 
used by the community. In addition to the terms of the 
GPL, we kindly ask to acknowledge the authors in any pub- 
lication or derivative software that uses MuMax, by citing 
this paper. The MuMax source code can be obtained via 
http://dyiiamat.ugent.be/mumax. To use the software, a 
PC with a "CUDA capable" nVIDIA GPU and a recent 64-bit 
Linux installation is required. 

A MuMax simulation is entirely specified by an input file 
passed via the command-line. I.e., once the input file is written, 
no further user interaction is necessary to complete the simu- 
lation. This allows, for instance, to run large batches of simu- 
lations unattended. Nevertheless, the progress of a simulation 
can easily be checked: the number of time steps taken, total 
simulated time, etc. is reported in the terminal, PNG images of 
the magnetization state can be output on-the-fly, graphs of the 
average magnetization can easily be obtained while the simu- 
lation is running, etc. Furthermore, MuMax's output format is 
compatible with OOMMF, enabling the use of existing post- 
processing tools to visualize and analyze the output. Built-in 
tools for output processing are available as well. The 3D vector 
field in Fig. |5] e.g., is rendered by MuMax's tools. 

MuMax input files can be written in Python. This offers pow- 
erful control over the simulation flow and output. As an exam- 
ple, the code snippet below simulates an MRAM element as in 



standard problem 2 (see section 4. 1 1. Starting form a uniform 
state in the +x direction, it scans the field B in small steps un- 
til the point of coercivity. This illustrates how easily complex 
simulations can be defined. 
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from mumax import * 

msat(800e3) 
aexch(1.3e-ll) 

partsize(500e-9, 50e-9, 5e-9) 
iiniformd, 0, 0) 

B = 

while avg_m('x') > 0: 

staticf ield(-B, 0, 0) 

relax () 

B += le-4 
# B now holds the coercitive field 
save Cm', 'bineiry') 

Figure 8: MuMax input file snippet illustrating a simulation specification in 
Python. After initialization, a field in the -x direction is stepped vmtU the aver- 
age magnetization along x reaches zero. The possibility of writing conditional 
statements and loops, and obtaining information like the magnetization state 
allows to construct arbitrarily complex simulation flows. 

7. Conclusions and Outlook 



MuMax is the first GPU-based micromagnetic solver that is 
publicly available as open-source software. Due to the large 
number of considered interaction terms and the versatile geo- 
metrical options (e.g. periodic boundary conditions) the soft- 
ware covers many of the classical micromagnetic research top- 
ics. The code is extensively validated by considering several 
standard problems and is shown to be reliable. The time gains 
are extremely large compared to CPU simulations: for large 
simulations a speedup with a factor 100 is easily obtained. 
These enormous speedups will open up new opportunities in 
micromagnetic modelling and boost fundamental magnetic re- 
search. 

In the future, MuMax will be extended towards a yet more 
multipurpose software package incorporating other interactions 
in the Landau-Lifshitz equation: other expressions for the ex- 
change contribution, exchange bias, magnetoelastic coupling, 
etc. Boundary correction methods to help in approximation 
non-square geometries better are currently also being consid- 
ered. Furthermore, the description of more complex, non uni- 
form microstructures will be made possible as e.g. nanocrys- 
talline materials. Modules for hysteresis research and magnetic 
domain studies will be developed following the presented 2D 
and 3D approach as well as the so-called 2.5D approach (in- 
finitely thick geometries). 

Furthermore, the efficient simulation of yet larger problems 
is planned by introducing multiple GPUs -in one or more 
machines- for one single simulation. This way, the up to now 
limited memory available on the GPU hardware can be circum- 
vented and the computational power will be further increased. 
Apart from requiring efficient communication between the dif- 
ferent GPUs, this should be possible without drastic changes to 
our code as most of the implementation is already hardware- 
independent. 
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